[codex] Optimize Vesuvius training runtime and validation by giorgioangel · Pull Request #802 · ScrollPrize/villa

giorgioangel · 2026-03-31T10:04:45Z

What changed

optimized Vesuvius training/runtime behavior in the vesuvius stack only
kept distributed validation with coherent rank-0 W&B aggregation and rotating preview selection
kept validation GIFs enabled while improving preview patch/slice selection
moved active run outputs and local W&B data to /ephemeral
added NUMA pinning support for CUDA DDP ranks and inherited worker affinity
added a binary edt fast path for surface dilation while keeping scipy fallback semantics for non-binary cases
kept persistent derived-target caching for validation and memory-only caching for training
preserved training-time skeletonization after augmentation
added regression tests for preview selection and dilation behavior

Why

the remote H100 canaries showed real validation wins from distributed validation and ps128 throughput gains from batch_size: 24
the remaining runtime issues were dominated by training/validation pipeline overhead and slow artifact I/O rather than connected-components metrics
the changes focus on the highest-likelihood real gains without staging operational artifacts or secrets

Validation

Ran on the remote repo under /home/ubuntu/villa/vesuvius:

.venv/bin/python3 -m py_compile on the touched Python files
PYTHONPATH=src .venv/bin/pytest tests/models/test_validation_preview.py tests/models/test_zarr_dataset_dilation.py
live remote canary profiling for ps128 and ps256
NUMA affinity verified on running rank processes via /proc/<pid>/status

Notes

PR intentionally excludes root-level operational artifacts and configs such as .patches_cache/, _codex_backup_20260331/, bench_edt_vs_scipy.py, and the root ps128_medial_default.yaml / ps256_medial_default.yaml
PR targets merge-ink-pipelines because the remote source branch is based on that branch rather than main

vercel · 2026-03-31T10:04:47Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
scrollprize-org	Ignored	Preview	Mar 31, 2026 10:15am

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9f706a32f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T10:57:18Z

vesuvius/src/vesuvius/models/augmentation/pipelines/training_transforms.py

+    # Deterministic skeleton targets are generated in the dataset before augmentation.
+    return None


Restore validation skeleton transform for non-Zarr datasets

create_validation_transforms now always returns None, but MutexAffinityDataset._initialize_transforms still invokes this helper when skeleton losses are configured. In that path, validation no longer produces *_skel tensors, and BaseTrainer._compute_loss_value will call skeleton losses without the required skel argument, which raises for DC_SkelREC_and_CE_loss/SoftSkeletonRecallLoss. This breaks mutex-affinity validation whenever skeleton-supervised losses are enabled.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-31T10:57:18Z

vesuvius/src/vesuvius/models/augmentation/transforms/utils/skeleton_transform.py

+            cache_key = self._cache_key(patch_info, target_key, ignore_value)
+            cached = self._cache_get(cache_key)
+            if cached is not None:


Disable skeleton cache for augmented training samples

The new cache key is based only on static patch_info (volume/position/patch size), but this transform is appended after stochastic augmentations in the training pipeline. That means repeated patches can reuse a cached skeleton computed for a different augmented variant, so {target}_skel can diverge from the current target tensor and silently corrupt skeleton-supervised training on ZarrDataset runs.

Useful? React with 👍 / 👎.

giorgioangel added 2 commits March 31, 2026 12:04

Optimize Vesuvius training runtime and validation

0712b92

Add NUMA and regression tests for Vesuvius runtime changes

ca224a3

Honor config epoch and step defaults in Vesuvius CLI

e9f706a

giorgioangel marked this pull request as ready for review March 31, 2026 10:51

giorgioangel requested a review from jrudolph as a code owner March 31, 2026 10:51

giorgioangel requested a review from bruniss March 31, 2026 10:51

chatgpt-codex-connector bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Optimize Vesuvius training runtime and validation#802

[codex] Optimize Vesuvius training runtime and validation#802
giorgioangel wants to merge 3 commits intomerge-ink-pipelinesfrom
codex/vesuvius-runtime-optimizations

giorgioangel commented Mar 31, 2026

Uh oh!

vercel bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Deterministic skeleton targets are generated in the dataset before augmentation.
		return None

Uh oh!

Conversation

giorgioangel commented Mar 31, 2026

What changed

Why

Validation

Notes

Uh oh!

vercel bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Mar 31, 2026 •

edited

Loading